RNA-seq data in 3 immune cells of 4 donors
STAR was run using the following options:
| Min. | 1st Qu. | Median | Mean | 3rd Qu. | Max. | |
|---|---|---|---|---|---|---|
| Total input, million reads | 30.74 | 32.7000 | 34.480 | 36.3400 | 40.3900 | 44.52 |
| Alignment rate (%), unique mapping | 80.45 | 82.9800 | 83.600 | 84.1800 | 85.3400 | 88.79 |
| Alignment rate (%), unique + multiple | 92.64 | 93.3500 | 93.640 | 93.6200 | 93.8700 | 94.52 |
| Mismatch rate (%) | 0.27 | 0.2800 | 0.295 | 0.2992 | 0.3200 | 0.33 |
| Deletion rate (%) | 0.02 | 0.0200 | 0.020 | 0.0200 | 0.0200 | 0.02 |
| Insertion rate (%) | 0.01 | 0.0100 | 0.010 | 0.0100 | 0.0100 | 0.01 |
| Too many loci (%) | 0.05 | 0.0675 | 0.110 | 0.1033 | 0.1325 | 0.15 |
| Too many mismatch (%) | 0.09 | 0.1000 | 0.110 | 0.1125 | 0.1200 | 0.14 |
| Too short (%) | 5.23 | 5.7850 | 6.055 | 6.0860 | 6.4050 | 7.10 |
In most RNA-seq data sets, the percentage of total input reads that can be aligned to reference genome/transcriptome could range between 50% and 90%. Alignment rate is an important quality index of RNA-seq library and high throughput sequencing. However, it also highly depends on the experimental material and protocol, so it is hard to have a predefined cutoff of “high” alignment rate for all data sets. On the other hand, the consistence of alignment rates between samples of the same data set is at least equally important. Inconsistency of alignment rates is usually the consequence of systematic bias during the whole experimental procedure. It adds unwanted between-sample variance into data and might have profound impact on statistic analysis, such as differential gene expression. Therefore, the focus of this analysis is whether there are libraries having much lower alignment rates than the others.
The rate of unique vs. multiple alignment is a similar index of data quality. High percent of multiple alignment might indicate low complexity of sequence reads, higher sequencing error rate, and other issues. This analysis also evaluates the consistency of unique vs. multiple alignment between samples.
An important aspect of processing RNA-seq data is to alignment sequence reads to splicing sites, called gap alignment. Most commonly, STAR performs gap alignment first by using the known splicing sites based on the reference transcriptome and then by detecting novel splicing sites based on the reference genome. Most splicing sites have canonical donor/acceptor bases, such as GT/AG. While non-canonical splicing sites have been observed, they are relatively rare and often suggestive of false positives.